Building Bayesian models of trajectories in dynamical physical systems using PyMC3

This article explores the ability of Bayesian models of trajectories to learn the properties of linear time dependent physical processes from observed data, using Normal, Student-T and Gaussian Random Walk priors. These models are tested against simulated noisy measurements and observations, with simulated outliers and missing observations, and are evaluated for their predictive power to generate trajectories beyond the sample domain.

Bayesian Inference

Overview

Many physical systems can be described in terms of trajectories through a vector space. The notion of a trajectory is important as it implies causality and sequence and captures the expectation that an observation at point $t$ is in some way related to a prior observation at $t−1$. This allows us to ascribe smoothness and single-valued continuity to our observational data, which is the basis of linear regression and non-parametric methods, and gives us the ability to infer predictive linear$^*$ models from sample data and hence extrapolate into unobserved regions. ([6] Léon Bottou "Nature doesn’t shuffle… so maybe we shouldn’t": Learning Representations Using Causal Invariance)

Trajectories are often described in the time dimension but in general any coordinate where the model state at some index $i$ can be computed from a previous state $i−1$ implies a trajectory. ([7] Parameter Trajectory Analysis to Identify Treatment Effects of Pharmacological Interventions).

For earlier work see The Power of Bayesian Inference estimated using PyMC3, and valuable material from Thomas Wiecki https://discourse.pymc.io/u/twiecki/summary, Junpeng Lao https://discourse.pymc.io/u/junpenglao/summary, https://discourse.pymc.io/t/oos-predictions-with-missing-input-values/423 and Juan Camilo Orduz https://juanitorduz.github.io/about/. Also ref [8] Gelman, A default prior distribution for logistic and other regression models.

Roadmap

  1. General discussion of trajectories, representation and visualization
  2. Example trajectory scenarios: Linear, Polynomial, Nonlinear
  3. Scenario setup: Python code defaults
  4. Scenario definitions
  5. Visualizing example trajectory scenarios: Matplotlib helpers and 2D visualizations
  6. Experiments: building model priors and simulated data
  7. Visualization of the test scenario: helpers and 3D visualizations
  8. Bayesian inference learning models: compute PyMC3 regression, Normal, Student-T, Gaussian Random Walk, sample data runs, with and without outliers
  9. Results: posterior predictions and generated trajectory comparisons
  10. Cumulative test results: RMSE summaries of trajectory tests

1 Trajectories

We can construct trajectories of $x$ through $t$ with observed $z=f(x,t)$, trajectories represented as ordered sequential equidistance steps in $t$, indexed on an integer $i$, and fixed step size in $t$, $delta_t$, with path defined in terms of the coordinate $x$ at each step $t_i$. Hence our trajectory can be written as a $z=f(x(t),t)$. $z$ has a direct dependency on $t$, combined with an indirect dependency due to the dependency of $x$ on $t$. We assume that $z$ and $t$ are scalars. In general $x$ could be multivariate but here we consider the one-dimensional case only.

Below we build multiple trajectories through the vector space from which we attempt to learn the parameters of the mathematical linear model $f(x,t)$. If the model parameters are represented as priors $θ$, the task is a Bayesian regression of $z=f(θ,x,t)$ conditioned on the observations $z_{obs}$ over $x$ and $t$.

1.1 Specification of trajectories

If the equation $z = f(x,t)$ describes a smooth surface in the $[x,t]$ vector space, a trajectory is a one-dimensional line across this surface specified by the path $x(t)$. This equation describes the time evolution of the dynamical physical system we are considering and captures some causal process derived from the physics of the system. In the Feynman formulation of quantum mechanics a system traverses many paths simultaneously, each with a certain known probability ([4] Quantum Mechanics and Path Integrals).

We can consider three degrees of complexity in the trajectory paths $x$ over $t$:

  1. Linear model In the simple case, for each trajectory, $x$ has an initial value which does not vary with $t$. Then each trajectory, index $n$, can be written as $z_n = f(x_n, t)$, where $x_n$ is a constant for that trajectory. For example in streamlined linear flow, the depth of a flow, $x_n$, as it passes through a barrier may remain unchanged from the initial depth $x_n(t_0)$, with the downstream velocity of the flow $v_n$ a function of the initial depth and time, $v_n = f(x_n(t_0), t)$.
  2. Polynomial model In the second case we consider that for each trajectory, the observable $z_n(t)$ is a function of $x$ and $t$ as before but the path $x_n(t)$ varies linearly over time. This leads to a polynomial type of dependency of $z$ on $t$. In the example above, if the flow passes over deformities in a riverbed, the depth $x$ may deviate from its initial value, with a corresponding change in the velocity $v_n(t) = f(x_n(t), t)$. The path of the trajectories can give a mapping of the riverbed topology. We must assume that the deviation of the flow, $x$ is not dependent on the velocity $v$, otherwise we will need to consider a non-linear model, below.
  3. Nonlinear model In the case where the observed values $z$ varies with $x$ over $t$ and simultaneously the path coordinate $x$ of the trajectory changes with a dependency on $z$, we have to consider a non-linear model:

$z(t) = f(x(t), t)$

$x(t) = f(z(t), t)$

Simple Spring (Ref The Damped Harmonic Oscillator)

For example for a simple spring with spring constant $k$, and a mass $m$, potential energy is a function of spring extension, and kinetic energy a function of the velocity of the mass. If we think of the mass's momentum $M$ as fixed, the mass will continue in uniform motion with momentum $M=m \dot{x}$. But the potential energy term resists this motion by transferring kinetic energy into potential energy as $x$ increases. In the Langrangian formulation this imposes a constraint on $x$ in terms of a rate of change of momentum $dM(t)/dt = -kx$. This is shown from the Lagrangian $L = \frac{1}{2}m\dot{x}^2 - \frac{1}{2}kx^2$, which combined with Euler-Lagrange minimisation [5], gives us:

[1.2.1.1] Generalised potential ${dM(t)}/{dt} ={\partial L}/{\partial x} = - kx(t)$

[1.2.1.2] Generalised momentum $M(t) = {\partial L}/{\partial \dot{x}} = m \dot{x}(t)$

2 Example trajectory scenarios

We construct sample trajectories based on a simple linear model $z = ax + bt$, observed at coordinates $(x, t)$. We allow a and b to vary linearly as a function of $x$. Hence $a = a_0 + px$ and $b = b_0 + qx$. If we expand this out we get a useful simple formulation for $z$ with which we can test our models:

$z = (a_0+px)x + (b_0+qx)t$

$z = a_0x+b_0t+px^2 + qxt$

In order to fully represent our trajectories we need to learn the parameters $a_0$, $b_0$, $p$, $q$.

We will test our model over a set of trajectories $n$, and we also need to model the path $x_n(t)$.

As described above we have three scenarios:

2.1 Linear model

$x(t) = x_n$, where $x_n$ is a constant for each trajectory $n$

2.2 Polynomial model

$x(t) = x_n(t_0) + v*t$, where $x_n(t_0)$ is an initial coordinate value and $v$ is a linear gradient of $x$ over $t$. The consequence of this variation in path $x$ over $t$ is that $z_n(t)$ is quadratic in $t$.

2.3 Nonlinear model - Mass $m$ on spring with spring constant $k$

$dz(t) = -kx(t)dt$ from [1.2.1.1]

$dx(t) = \frac{z(t)}{m}dt$ from [1.2.1.2]

We can construct each trajectory $n$ numerically with the following steps, at each iteration $i$ in $t$:

for each $i$ > 0 in $trajectory_n$

$dz = -k * x_n(i-1)*dt$

$z_n(i) = z_n(i-1) + dz$

$x_n(i) = x_n(i-1) + z_n(i)/m * dt$

3 Scenario setup

Our input space is created on a 2-dimensional mesh of x and t. Each step in x represents a new trajectory and each step in t an iteration in the trajectory time evolution. The observables, z, are calculated from one of the three models as specified above.

Model parameters are specified in code as slope = {$a_0, b_0$} and slope_bias = {$p, q$}, with other defaults as initialised below:

4 Scenario definitions

5 Visualizing example trajectory scenarios

For plotting we have used Matplotlib, with various helper functions defined below. Note that for 3D interactive visualization when running inside Jupyter notebook the setup %matplotlib inline should be replace by %matplotlib widget.

2D Plot helper functions

5.1 Linear trajectory model

Here we visualize the simple linear model, with $x$ constant for each trajectory, for the case of $a$ and $b$ as fixed constants and for the case where $a$ and $b$ vary with $x$ by the parameters $p$ and $q$. Note that even though the surface is quadratic, each trajectory is linear.

5.2 Polynomial trajectory model

In this case we allow the path of $x(t)$ to vary linearly over $t$. $z$ is now polynomial in $x$ and $t$.

5.3 Nonlinear trajectory model

In this case we numerically compute the path $x(t)$ as described in 2.3 above, giving us the anticipated simple harmonic behaviour.

6 Experiments

Science proceeds from discovery and observation through to hypothesis, testing and prediction. Bayesian methods can tell us about the behaviour of underlying physical properties and the relationships between them in terms of priors and observed data, from which we can predict the likelihood of certain experimental outcomes. These predictions may lie within the space of the observed data, i.e. interpolation, or beyond, i.e. extrapolation, or out of the sample domain, i.e. prediction. In general prediction is much harder than interpolation or extrapolation as we must make assumptions about regions of space which we have not observed. A useful theory will fit observations in the sampled space but will also allow us to make testable predictions beyond the scope of our current knowledge.

6.1 Model priors

This article explores in some detail the polynomial trajectory scenario described, 2.2, using Python with PyMC3 to construct and test three Bayesian models:

  1. Normal: Polynomial model with normal prior on observations
  2. Student-T: Polynomial model with Student-T prior on observations
  3. Gaussian Random Walk (GRW): Polynomial model combined with non-parametric Gaussian random walk prior

Our prior belief for path measurements $x$ is that they vary linearly with $t$ with gradient $v_{xt}$, and observed $z=z_0+ax+bt$ where a and b are both allowed to vary as a function of $x$, with constant gradients p and q, $a=(a_0+p∗x)$, $b=(b_0+q∗x)$.

$x(t) = x_0 + v_{xt} * t$

$z(t)= (a_0+p*x)*x+(b_0+q*x)*t$ = $a_0*x+b_0*t+p*x^2 + q*x * t$

6.2 Testing experimental outcome

Given sample data generated from 'true' parameters, our objective in these experiments is to find a model which best infers the 'true' underlying sample data parameters $v_{xt}, a_0, b_0, p, q$ over a range of tests.

6.3 Noise in measurements and observations

We assume observations of $z$ at fixed time points $t_i$ over the predicted path $x(t)$. We allow that there is observational error in each modelled $z$, recorded as $z_{obs}(t_i)$, with prior mean $z(t)$. At each step in $t$ we also measure the coordinate $x_{obs}(t_i)$ conditioning the prior mean $x(t)$.

6.4 Outliers

T. Wiecki in his article [1] shows that a Student-T distribution is better than the Normal distribution at handling outliers in linear regression, as the Student-T distribution has longer tails than the Normal error prior. This conveys the notion that an outlier is not to be considered wholly unexpected.

The value of the GRW model is its ability to model local outlier behaviour while giving the inductive power of linear models to make predictions outside the sample space.

6.5 Missing data

A non-parametric model should always be able to model the observed data to any required degree of accuracy and so we hide a number of observations in order to test the model's interpolation of missing values.

The true test of the model is its ability to predict data that lies outside the sample space. This can be simulated by hiding some of the observed trajectories and then allowing the model to impute the values in these trajectories. This is much more challenging than imputing missing values that lie within the range of an observed trajectory, which is an interpolation task, or that lie beyond the extremity of a trajectory, which is an extrapolation task. We require that the model detect and apply inferred model parameters to predict along any new trajectory. To create this test we hide a given number of trajectories, keeping training_trajectories from which to learn the model parameters.

These tests are forms of "Leave-Out-One(LOO)/Leave-Out-Batch" model evaluation techniques, ref [9] Aki Vehtari: Model assessment, selection and averaging, Aalto University

6.6 Simulated data

We simulate real data by applying the following perturbations to the 'true' sample data:

  1. Measurement noise - each measurement $x$ has a Gaussian random noise element added, centred on the 'true' value $x(t)$
  2. Observational noise - each observation $z$ has a Gaussian random noise element added, centred on the 'true' value $z(t)$
  3. Anomalies/outliers - a set of points on each trajectory are displaced by random mean, with variance, which puts those observations beyond their anticipated value
  4. Missing data - a number of randomly chosen observations are flagged as 'missing', or unobserved, which gives the inference model the task of imputing these missing values
  5. Missing trajectories - a chosen number of trajectories are flagged as 'missing' giving the model the task of predicting out of sample trajectories from the model parameters

For test purposes we keep the 'true' data, from scenario 2.2, xt_true_2 and z_true_2, and our training data is calculated with and without _noise and with and without _missing_trajectories.

6.7 Creating and visualizing simulated data

7 Visualization of the test scenario

In 2D projection it can be hard to visualize the sample space, measurements, trajectories and outliers. Here we adopt a wireframe and a contour plot to make the scenario a little clearer. Note that to animate Matplotlib charts within a Jupyter notebook, replace the rendering engine at the top of this notebook as follows: %matplotlib inline replace with %matplotlib widget

As can be seen in the chart below, the effect of the linear model $z=ax+bt$, with variation in a and b parameters, $a=(a_0+p∗x)$, $b=(b_0+q∗x)$, and the path $x=x_0+v_{xt}t$ creates a parabolic noisy surface.

7.1 3D Plot helper functions

7.2 Cross correlation data view

This plot is a conventional view of the relationship between the variables $x$, $t$ and $z$.

8 Bayesian inference learning models

8.1 Non-parametric models

Non-parametric methods such as Gaussian processes (GP) have been shown to be effective in modelling the fine structure of observed data by relying on assumptions about the degree of local smoothness of the observations. It is reasonable to assume that most physical processes are smoothly changing, with certain time constants, and this property allows us to make good interpolation of observed data. However if we have inferred only the properties of local smoothness (not considering methods using periodic or non-local kernels), we have not modelled the long range behaviour of the parameters, and our non-parametric model will revert to the unconditioned statistical mean of our priors outside the sample space.

8.2 Linear and polynomial models

These models make the assumption that there is a simple polynomial relationship between the predictor variables and the observed outcome. Once the model parameters have been inferred from the observational data, it is straightforward to make out of sample predictions. The strength of the model is its simplicity in modelling out of sample data.

8.3 Gaussian Random Walk (GRW)

Random walk is a stochastic, iterative process which generates a single connected line trajectory through an n-dimensional space. Each point on the trajectory is represented by its index, in relation to the previous point, and an n-dimensional vector representing the coordinate values at that point. Points on the trajectory are random n-vectors drawn from a known multivariate probability distribution

8.4 Model definitions

Below are the code definitions for the Normal, Student-T and GRW models which we going to test.

In each case, we setup a prior on the $x_{obs}$ path variables for intercept and common gradient $v_{dxdt}$. So each trajectory has a path $x_n(t) = x_n(0) + v_{dxdt} * t$. $x_0$ and $v_{dxdt}$ have simple Normal priors with $\mu=0$ and $\sigma = 1$.

We also set up the $z$ target function for $z_{obs}$ with slope $a_0$, $b_0$ and slope_bias p,q, with Normal priors $\mu=0$ and $\sigma = 20$. $z_n(t) = a_0*x_n+b_0*t+p*x_n^2 + q*x_n * t$, where $x_n$ is taken from the conditioned prior for x.

8.4.1 Normal Model

For the Normal model we condition the prior for $z$ using a simple Normal prior, with $\mu$ taken from the $z$ target function and observational error prior $\sigma$ from Half Cauchy distribution having $\beta = 1$.

8.4.2 Student-T Model

For the Student-T model we condition the prior for $z$ using a Student-T prior, with $\mu$ taken from the $z$ target function and observational error prior $\sigma$ from Half Cauchy distribution having $\beta = 1$ and degrees of freedom $\nu$ from Gamma distribution with $\alpha=2$ and $\beta = 0.1$. Please refer to our previous article for further discussion on this.

8.4.3 Gaussian Random Walk Model

At each step the mean of the random vector draw is set to the value of the preceding point, such that successive points evolves along a connected path. A constant, known as 'drift', is added to every sample vector draw, having the effect of biasing the path in the observed data direction. We set the drift constants of the random walk path to be the gradient differences of the target function at each point along the path. This means that as we condition the random walk to the observed data, it will learn the values of the parameters that make up the drift, which are our target properties.

The GRW combines the properties of a non-parametric model, because it can easily track any path through the parameter space, with the ability to model a trend by learning the drift constants across the parameter space. The conditioning process samples many possible random walks along with a likelihood such that the mean random walk trajectory passes through the observations.

We again use a Student-T prior to condition the observed data against the sample random walk draws with $\sigma = .2$ and degrees of freedom $\nu = 2$. If we did not have missing observations to deal with this would be sufficient to condition the random walk, but in the regions where we do not have observations the random walk can behave chaotically and the actual sample values diverge widely from the drift value, making the overall model ill-conditioned. To resolve this, we add a conditioning constraint that the sampled random walk should follow the target function $z$, with a Normal distribution at each point. Also, to make the random walk samples more stable we set a lower and upper bound on the prior distribution to keep the sample draws within 2.0 * the min and max values of the observed data.

The result of conditioning both on the computed function $z$ and the observational data $z_{obs}$ is that we are able to obtain a solution that satisfies both the noisy sample data, where it exists, and also the learned model in the regions of unobserved data. This is a tug of war between these two distributions and it has to be carefully setup with priors and prior meta parameters if it is to converge consistently in all the scenarios we have discussed. The model meta parameters were manually configured after tests to ensure robustness but this was a time-consuming and fragile process and a proper regularization regime is required for ongoing work.

8.5 Model helpers

8.6 Setup results dataset

In this section we run each of the three models on the noisy data with all trajectories included for training, but with some missing observations, which we term the Inference model, and we also run each model on the same data but with some trajectories entirely hidden from the model, which we term the Predictive model, since it has to apply the learned model parameters to the trajectories not represented in the sample data. This latter is a challenging task, particularly in the non-parametric GRW case, as it is a true test that the model has learned the correct parameters of the target function and can make out of sample predictions.

We have conducted various tests which show that it is challenging to get generalised models that perform well in multiple scenarios, with varying degrees of noise, outliers and missing data. The models are sensitive to the choice of priors and meta-parameters. We have tuned these models by hand, but clearly a robust regularization strategy would be preferred.

The run results sets are saved to a dictionary and tabulated at the end of this section.

Because we are using simulated data with added noise, we can calculate the RMSE of the simulated noise, with and without outliers, and include these in our summary figures for comparison purposes against our model outcomes.

8.7 Normal Model

8.7.1 Posterior Traceplot of Normal model with no missing trajectories

From the traceplot below, we can see the posterior distributions for each of the priors, with 4 overlapping lines representing results for each chain.

It can be seen that the estimates for the observed x intercepts are very well defined and map to the true values, ranging from -6 to +6. Also the gradient v in the path x = x + vt is close to its true value of 0.42. This is encouraging as this prior is shared for all the three models, Normal, Student-T and GRW.

The parameters for slope and slope_bias, which hold $a_0$, $b_0$, $p$ and $q$, have smooth distributions, with clean estimates across all samples, with good overlap for the 4 different chains.

Finally, the imputed missing observations which are interpolated into the trajectories, represented by o_missing, are all well resolved.

This gives us confidence that the model is behaving well.

Bayesian Inference

pm.traceplot(trace_normal_lm)

8.8 Student-T Model

8.9 GRW Model

9 Results

Other than by visual inspection, the best way to compare the above models is to find an objective measure of how closely the inferred parameters generate the true data values. If for each model run we take the parameters, and use them to compute the data across a grid of x and t coordinates, we can compare the computed values against our original clean 'true' $z$ data values and take the Root Mean Square (RMSE) difference across observed and unobserved data. RMSE is a good comparative measure of how closely different data sets match a reference data set. The units of measure of RMSE are the original coordinate units.

We have tabulated below the learned parameter values and the RMSE for each model, trained with all trajectories as inputs, inference, and trained with some missing trajectories, predictive. We would expect the latter generally to be poorer than the former, since the amount of training data is less, and does not include all the data points, hence the RMSE should be larger.

We can see from the results that all the models gave RMSE values lower than the simulated data noise. This is a good indication that the models have converged to a good answer.

We can also see that the GRW model performs best, and the Student-T is somewhat better than the Normal model.

In order to get a better feel for these results the following visualizations show the posterior predictions for each trajectory along with their credibility intervals, charted against the observed data at each trajectory time point.

As can be seen the non-parametric GRW model follows the observed data very closely, which we expect. But to be sure we have obtained a good result we need to take the inferred model parameters, $a_0$, $b_0$, $p$, $q$ and use them to generate new sample observations across some space in $x$ and $t$, and compare these with 'true' sample data computed at the same points. We have done this in the final set of charts below. The overall impression is that we can closely model these trajectories using these Bayesian models.

10 Cumulative test results

Having developed the models as above, the chart below summarises the results of testing the model in various scenarios:

  1. Run the three models Normal, Student-T and Random Walk with 13 trajectories, systematically increasing the number of unobserved trajectories from zero to five. This effectively tests the predictive power of the models in unobserved regions of the sample space, with the 'training' trajectories decreasing from thirteen, the first being a fit with no missing trajectories, the last being a fit with eight training trajectories.
  2. We also run the same tests but decreasing the unobserved trajectories from the opposite end of the trajectory array, trajectory 13, removing in reverse order. This is to counter any artifacts in the shape of the trajectories or their relative positions.
  3. For each of these runs, we test the effect of removing the outliers.
  4. We also test removing the sample observation noise level from the case with all the trajectories, testing both with and without outliers. We have plotted this at each end of the chart below, duplicating the figures for comparison purposes. The case of zero trajectories removed is also duplicated on the left and right of the chart for comparison purposes.
  5. With fewer than eight trajectories observed out of thirteen, the model becomes poorly conditioned, we believe because then the observed data does not cover the point of inflection in the observations, which is effectively a saddle point in the model. In general, it is difficult to model a polynomial type inflection unless the observed data covers the transition through the change in gradient direction. In our case, the models then gave poor fits and singularities in the sampling.

10.1 Discussion of results

From the chart above we make the following observations:

  1. First of all it should be noted that for all three model the results fit the 'true' data closely. In all experiments the calculated models for cases with and without outliers have lower RMSE than that of the original sample data. This is encouraging as it suggests that model refinement has obtained a fit that is relatively impervious to sample noise, whether it is random Gaussian noise or systematic noise due to outliers.
  2. As expected, the sample data without outliers allows a better fit to the 'true' data than the models generated from data with outliers. There is some indication that this difference becomes less as we enter the region where we have fewer training trajectories.
  3. For both cases, with and without outliers, the Random Walk Model is better in terms of RMSE than the Student-T model, which is better than the Normal model. In the case where we remove outliers the difference is small and all three models perform well.

10.2 Conclusion

The intention of this project was to test the viability of using Bayesian learning with PyMC3 to determine the parameters of physical systems via their observed trajectories across a surface in a vector space.

  1. We have been able to get good results for the inferred physical parameters across various scenarios and models.
  2. We have shown that the Gaussian random walk model seems to have some benefits in being less susceptible to systematic outliers in the sample data than the Normal and Student-T models.
  3. All three models were robust to various data sets, including sparse data, and converged quickly in real time.
  4. There were limits to the extent to which unobserved trajectories could be inferred from the training trajectories within the context of the model run. We do not believe this is a limitation since in practice once the physical properties have been learned from the observed data, these parameters can be used to predict out of sample data without needing the context of the PyMC3 model, simply by plugging them into the target mathematical model.
  5. The problem of regularizing the Gaussian random walk is a serious one and if this is to be used as a generative distribution in future we need to improve the automating of the regularization of the meta parameters.

We would like to continue this work and in particular extend it to the non-linear case, in the first instance to learn the properties of a simple harmonic oscillator from observed trajectories, rather than by solving the differential equations as can be done now.

References

  1. T. Wiecki, GLM: Robust Linear Regression (2018), PyMC3
  2. J. Salvatier1, T. Wiecki, C. Fonnesbeck, Probabilistic programming in Python using PyMC3 (2016), PeerJ Computer Science
  3. J. Salvatier1, C. Fonnesbeck, T. Wiecki, GLM: Robust Regression with Outlier Detection (2016), PyMC3
  4. R. P. Feynman and A. R. Hibbs, Quantum Mechanics and Path Integrals McGraw–Hill Lectures/01.pdf (1965)
  5. C. Torre Introduction. Configuration space. Equations of Motion. Velocity Phase Space (2010)
  6. Léon Bottou, Learning Representations Using Causal Invariance (2019), ICLR 2019, New Orleans
  7. Christian A. Tiemann, Joep Vanlier, Maaike H. Oosterveer, Albert K. Groen, Peter A. J. Hilbers, Natal A. W. van Riel Parameter Trajectory Analysis to Identify Treatment Effects of Pharmacological Interventions (2013) PLoS Comput Biol
  8. A. Gelman, A. Jakulin, M. Grazia Pittau, Yu-Sung Su, A default prior distribution for logistic and other regression models (2006) columbia.edu
  9. A. Vehtari: Model assessment, selection and averaging (2018) Aalto University https://users.aalto.fi/~ave/index.html

$^*$We have sometimes used the term linear loosely to describe any function that is a linear superposition of parameters, the justification for this being that we can replace higher order terms with an additional variable, for example in $x^2$.